-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support larger alphabets and k via generic kmer_t #39
Conversation
Hi @adamant-pwn and thanks for this! I quickly skimmed through the commits and the code seems really well done. So, as far as I see, you use a template for the internal storage of a kmer. On the same line, do you think one could use __uint256_t ? As far as I know, it is not yet standard nor well supported as an "integer" data type since it is a vector actually... I'm curious to see what could happen with, say, k=96 or k=127. The space of SSHash should improve a lot! Thanks, |
Thanks for the pointers, I will go over the datasets that you shared, and get back with results here. In the meantime, could you elaborate a bit on which Theoretically, I can also make a bit of changes to support Also do you have any datasets to check against 256-bit kmers? |
Hi @adamant-pwn, You're right, sorry for the confusion, I was referring to -Giulio |
Hi @jermp I tried running build on human.k63 with In the meantime please see below the results for human.k63 with 128-bit type: Bench log with __uint128_tk = 63, m = 31, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/human.k63.unitigs.fa.ust.fa.gz'...
m_buffer_size 29411764
read 100000 sequences, 105696573 bases, 99496635 kmers
read 200000 sequences, 214697764 bases, 202297826 kmers
read 300000 sequences, 314555962 bases, 295956024 kmers
read 400000 sequences, 417096019 bases, 392296081 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.0.bin'...
read 500000 sequences, 531577423 bases, 500577485 kmers
read 600000 sequences, 641881766 bases, 604681828 kmers
read 700000 sequences, 753533625 bases, 710133687 kmers
read 800000 sequences, 860572728 bases, 810972790 kmers
read 900000 sequences, 966201746 bases, 910401808 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.1.bin'...
read 1000000 sequences, 1074799974 bases, 1012800036 kmers
read 1100000 sequences, 1177786303 bases, 1109586365 kmers
read 1200000 sequences, 1284445481 bases, 1210045543 kmers
read 1300000 sequences, 1387959208 bases, 1307359270 kmers
read 1400000 sequences, 1496344159 bases, 1409544221 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.2.bin'...
read 1500000 sequences, 1601642232 bases, 1508642294 kmers
read 1600000 sequences, 1704579905 bases, 1605379967 kmers
read 1700000 sequences, 1809734567 bases, 1704334629 kmers
read 1800000 sequences, 1913668987 bases, 1802069049 kmers
read 1900000 sequences, 2015410398 bases, 1897610460 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.3.bin'...
read 2000000 sequences, 2111050791 bases, 1987050853 kmers
read 2100000 sequences, 2208479896 bases, 2078279958 kmers
read 2200000 sequences, 2297838310 bases, 2161438372 kmers
read 2300000 sequences, 2384042166 bases, 2241442228 kmers
read 2400000 sequences, 2476010359 bases, 2327210421 kmers
read 2500000 sequences, 2554605267 bases, 2399605329 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.4.bin'...
read 2600000 sequences, 2632827152 bases, 2471627214 kmers
read 2700000 sequences, 2705855836 bases, 2538455898 kmers
read 2800000 sequences, 2777866668 bases, 2604266730 kmers
read 2900000 sequences, 2846376916 bases, 2666576978 kmers
read 3000000 sequences, 2913930048 bases, 2727930110 kmers
sorting buffer...
saving to file './sshash.tmp.run_1706705954074597201.minimizers.5.bin'...
read 3079563 sequences, 2961741299 bases, 2770808393 kmers
num_kmers 2770808393
num_super_kmers 165899482
num_pieces 3079564 (+0.137818 [bits/kmer])
=== step 1: 'parse_file' 646.326 [sec] (233.263 [ns/kmer])
== files to merge = 6
num_written_tuples = 50000000
num_written_tuples = 100000000
num_written_tuples = 150000000
num_written_tuples = 165899482
=== step 2: 'build_minimizers' 104.043 [sec] (37.5497 [ns/kmer])
bits_per_offset = ceil(log2(2961741362)) = 32
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1706706704653830813.bucket_pairs.0.bin'...
num_singletons 147668623/150850485 (97.8907%)
=== step 3: 'build_index' 123.475 [sec] (44.5628 [ns/kmer])
max_num_super_kmers_in_bucket 36004
log2_max_num_super_kmers_in_bucket 16
num_buckets_in_skew_index 23561/150850485(0.0156188%)
num_partitions 7
computing partitions...
num_kmers belonging to buckets of size > 64 and <= 128: 17348731
num_kmers belonging to buckets of size > 128 and <= 256: 14340210
num_kmers belonging to buckets of size > 256 and <= 512: 11881499
num_kmers belonging to buckets of size > 512 and <= 1024: 11484349
num_kmers belonging to buckets of size > 1024 and <= 2048: 8910908
num_kmers belonging to buckets of size > 2048 and <= 4096: 6277830
num_kmers belonging to buckets of size > 4096 and <= 36004: 10551588
num_kmers_in_skew_index 80795115(2.91594%)
building PTHash mphfs (with 8 threads) and positions...
lower 64; upper 128; num_bits_per_pos 7; keys_in_partition.size() 17348731
built mphs[0] for 17348731 keys; bits/key = 2.3249
built positions[0] for 17348731 keys; bits/key = 7.00002
lower 128; upper 256; num_bits_per_pos 8; keys_in_partition.size() 14340210
built mphs[1] for 14340210 keys; bits/key = 2.34678
built positions[1] for 14340210 keys; bits/key = 8.00003
lower 256; upper 512; num_bits_per_pos 9; keys_in_partition.size() 11881499
built mphs[2] for 11881499 keys; bits/key = 2.24441
built positions[2] for 11881499 keys; bits/key = 9.00003
lower 512; upper 1024; num_bits_per_pos 10; keys_in_partition.size() 11484349
built mphs[3] for 11484349 keys; bits/key = 2.2486
built positions[3] for 11484349 keys; bits/key = 10
lower 1024; upper 2048; num_bits_per_pos 11; keys_in_partition.size() 8910908
built mphs[4] for 8910908 keys; bits/key = 2.28181
built positions[4] for 8910908 keys; bits/key = 11
lower 2048; upper 4096; num_bits_per_pos 12; keys_in_partition.size() 6277830
built mphs[5] for 6277830 keys; bits/key = 2.32959
built positions[5] for 6277830 keys; bits/key = 12.0001
lower 4096; upper 36004; num_bits_per_pos 16; keys_in_partition.size() 10551588
built mphs[6] for 10551588 keys; bits/key = 2.25964
built positions[6] for 10551588 keys; bits/key = 16
num_bits_for_skew_index 985400144(0.355636 [bits/kmer])
=== step 4: 'build_skew_index' 120.959 [sec] (43.6548 [ns/kmer])
=== total_time 994.803 [sec] (359.03 [ns/kmer])
total index size: 1617.41 [MB]
SPACE BREAKDOWN:
minimizers: 0.154931 [bits/kmer]
pieces: 0.015002 [bits/kmer]
num_super_kmers_before_bucket: 0.0904985 [bits/kmer]
offsets: 1.91597 [bits/kmer]
strings: 2.13782 [bits/kmer]
skew_index: 0.355636 [bits/kmer]
weights: 5.31253e-07 [bits/kmer]
weight_interval_values: 9.23918e-08 [bits/kmer]
weight_interval_lengths: 3.46469e-07 [bits/kmer]
weight_dictionary: 9.23918e-08 [bits/kmer]
--------------
total: 4.66986 [bits/kmer]
avg_nanosec_per_positive_lookup 3659.03
avg_nanosec_per_negative_lookup 4382.01
avg_nanosec_per_positive_lookup_advanced 3074
avg_nanosec_per_negative_lookup_advanced 3624.09
avg_nanosec_per_access 1380.77
iterator: avg_nanosec_per_kmer 27.3282 I used the command ./sshash build -i ../data/unitigs_stitched/human.k63.unitigs.fa.ust.fa.gz -k 63 -m 31 --bench with sshash built in Release mode, the command was executed on AMD EPYC-Rome (15) @ 2.249GHz. |
Hi @adamant-pwn, Of course a 15X slowdown is not acceptable. I guess this comes from the uint256 type being bad rather than your code since it works very well for k <= 63. We have to come up with a more efficient way of handling larger kmers... |
Hi @jermp, I spent a bit more time investigating possible ways to use a 256-bit type. At first, I did some changes to accommodate usage of I ended up making a new structure I think for really long numbers, it would be more efficient to just store them in a plain array or vector rather than in a tree (like I removed any strings that is shorter than 127 from Bench log with bitpack<__uint128_t, 1> and k=127k = 127, m = 31, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/human.k127.unitigs.fa.ust.fa.gz'...
m_buffer_size 29411764
read 100000 sequences, 182189115 bases, 169589241 kmers
read 200000 sequences, 353905567 bases, 328705693 kmers
read 300000 sequences, 535877434 bases, 498077560 kmers
read 400000 sequences, 717529176 bases, 667129302 kmers
read 500000 sequences, 894938693 bases, 831938819 kmers
read 600000 sequences, 1069155657 bases, 993555783 kmers
read 700000 sequences, 1238429639 bases, 1150229765 kmers
read 800000 sequences, 1407403279 bases, 1306603405 kmers
sorting buffer...
saving to file './sshash.tmp.run_1707337830575510935.minimizers.0.bin'...
read 900000 sequences, 1577475282 bases, 1464075408 kmers
read 1000000 sequences, 1743208932 bases, 1617209058 kmers
read 1100000 sequences, 1907995326 bases, 1769395452 kmers
read 1200000 sequences, 2065574063 bases, 1914374189 kmers
read 1300000 sequences, 2216324177 bases, 2052524303 kmers
read 1400000 sequences, 2362189594 bases, 2185789720 kmers
read 1500000 sequences, 2500616290 bases, 2311616416 kmers
read 1600000 sequences, 2626636583 bases, 2425036709 kmers
read 1700000 sequences, 2745079914 bases, 2530880040 kmers
sorting buffer...
saving to file './sshash.tmp.run_1707337830575510935.minimizers.1.bin'...
read 1797875 sequences, 2853028046 bases, 2626495796 kmers
num_kmers 2626495796
num_super_kmers 55359550
num_pieces 1797876 (+0.172498 [bits/kmer])
=== step 1: 'parse_file' 1869.21 [sec] (711.673 [ns/kmer])
== files to merge = 2
num_written_tuples = 50000000
num_written_tuples = 55359550
=== step 2: 'build_minimizers' 26.5036 [sec] (10.0909 [ns/kmer])
bits_per_offset = ceil(log2(2853028173)) = 32
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1707339726346319196.bucket_pairs.0.bin'...
num_singletons 50839844/51714801 (98.3081%)
=== step 3: 'build_index' 33.2175 [sec] (12.6471 [ns/kmer])
max_num_super_kmers_in_bucket 29204
log2_max_num_super_kmers_in_bucket 15
num_buckets_in_skew_index 5287/51714801(0.0102234%)
num_partitions 7
computing partitions...
num_kmers belonging to buckets of size > 64 and <= 128: 10007008
num_kmers belonging to buckets of size > 128 and <= 256: 7471182
num_kmers belonging to buckets of size > 256 and <= 512: 6508079
num_kmers belonging to buckets of size > 512 and <= 1024: 5386165
num_kmers belonging to buckets of size > 1024 and <= 2048: 3677016
num_kmers belonging to buckets of size > 2048 and <= 4096: 2186595
num_kmers belonging to buckets of size > 4096 and <= 29204: 4977749
num_kmers_in_skew_index 40213794(1.53108%)
building PTHash mphfs (with 8 threads) and positions...
lower 64; upper 128; num_bits_per_pos 7; keys_in_partition.size() 10007008
built mphs[0] for 10007008 keys; bits/key = 2.26646
built positions[0] for 10007008 keys; bits/key = 7.00004
lower 128; upper 256; num_bits_per_pos 8; keys_in_partition.size() 7471182
built mphs[1] for 7471182 keys; bits/key = 2.30375
built positions[1] for 7471182 keys; bits/key = 8.00004
lower 256; upper 512; num_bits_per_pos 9; keys_in_partition.size() 6508079
built mphs[2] for 6508079 keys; bits/key = 2.32488
built positions[2] for 6508079 keys; bits/key = 9.00005
lower 512; upper 1024; num_bits_per_pos 10; keys_in_partition.size() 5386165
built mphs[3] for 5386165 keys; bits/key = 2.35027
built positions[3] for 5386165 keys; bits/key = 10.0001
lower 1024; upper 2048; num_bits_per_pos 11; keys_in_partition.size() 3677016
built mphs[4] for 3677016 keys; bits/key = 2.26738
built positions[4] for 3677016 keys; bits/key = 11.0001
lower 2048; upper 4096; num_bits_per_pos 12; keys_in_partition.size() 2186595
built mphs[5] for 2186595 keys; bits/key = 2.34717
built positions[5] for 2186595 keys; bits/key = 12.0002
lower 4096; upper 29204; num_bits_per_pos 15; keys_in_partition.size() 4977749
built mphs[6] for 4977749 keys; bits/key = 2.36104
built positions[6] for 4977749 keys; bits/key = 15.0001
num_bits_for_skew_index 476511808(0.181425 [bits/kmer])
=== step 4: 'build_skew_index' 49.42 [sec] (18.816 [ns/kmer])
=== total_time 1978.35 [sec] (753.226 [ns/kmer])
total index size: 1026.85 [MB]
SPACE BREAKDOWN:
minimizers: 0.0575382 [bits/kmer]
pieces: 0.00957263 [bits/kmer]
num_super_kmers_before_bucket: 0.0321532 [bits/kmer]
offsets: 0.674475 [bits/kmer]
strings: 2.1725 [bits/kmer]
skew_index: 0.181425 [bits/kmer]
weights: 5.60443e-07 [bits/kmer]
weight_interval_values: 9.74683e-08 [bits/kmer]
weight_interval_lengths: 3.65506e-07 [bits/kmer]
weight_dictionary: 9.74683e-08 [bits/kmer]
--------------
total: 3.12766 [bits/kmer]
avg_nanosec_per_positive_lookup 4598.98
avg_nanosec_per_negative_lookup 5773.25
avg_nanosec_per_positive_lookup_advanced 4263.72
avg_nanosec_per_negative_lookup_advanced 5329.6
avg_nanosec_per_access 1585.56
iterator: avg_nanosec_per_kmer 38.8011 So, while there is some slowdown, I think it's reasonable this time, as there is no builtin support for 256-bit integers, and we have to emulate them with 128-bit integers. Also the size indeed reduced quite significantly, from 1617.41 [MB] (4.67 bits/kmer) to 1026.85 [MB] (3.13 bits/kmer), so it should also make up for the slowdown per kmer. I also tried running it with |
Hi @adamant-pwn, And very nice space result: increasing k improves the space of SSHash. I wonder, where it goes in the limit :) Although storing a human genome in 3.13 bits/kmer is remarkable, I do not think the experiment is truly correct Thank you very much! |
Thanks for the quick response! Sure, the space result looks good. And I do hope that it is genuine and not a result of some bug in the code that misses something or not accounts for something when k goes beyond 63 🥲 Regarding testing on unitigs for k=127, I assume the appropriate raw data is Homo_sapiens.GRCh38.dna.toplevel.fa.gz? |
I think there must necessarily be some errors when parsing a dBG of order 63 with k=127, but the space result should be quite close though. Yes, that is the raw data: Homo_sapiens.GRCh38.dna.toplevel.fa.gz. I would suggest to use GGCAT with the flag |
Ok, below is the log on the "toplevel" dataset. I prepared the input the same way as described in the README: Bench log on the "toplevel" human DNA, k=127k = 127, m = 31, seed = 1, l = 6, c = 3, canonical_parsing = false, weighted = false
reading file '../data/unitigs_stitched/Homo_sapiens.GRCh38.dna.toplevel.fa.unitigs.fa.ust.fa.gz'...
m_buffer_size 29411764
read 100000 sequences, 641461998 bases, 628862124 kmers
read 200000 sequences, 1299384399 bases, 1274184525 kmers
sorting buffer...
saving to file './sshash.tmp.run_1707385092832615700.minimizers.0.bin'...
read 300000 sequences, 1765189961 bases, 1727390087 kmers
read 400000 sequences, 2233033487 bases, 2182633613 kmers
read 500000 sequences, 2631516533 bases, 2568516659 kmers
read 600000 sequences, 2921679399 bases, 2846079525 kmers
sorting buffer...
saving to file './sshash.tmp.run_1707385092832615700.minimizers.1.bin'...
sorting buffer...
saving to file './sshash.tmp.run_1707385092832615700.minimizers.2.bin'...
read 608208 sequences, 2934546233 bases, 2857912025 kmers
num_kmers 2857912025
num_super_kmers 58916081
num_pieces 608209 (+0.0536296 [bits/kmer])
=== step 1: 'parse_file' 2035.8 [sec] (712.338 [ns/kmer])
== files to merge = 3
num_written_tuples = 50000000
num_written_tuples = 58916081
=== step 2: 'build_minimizers' 35.5332 [sec] (12.4333 [ns/kmer])
bits_per_offset = ceil(log2(2934546360)) = 32
m_buffer_size 20833333
sorting buffer...
saving to file './sshash.tmp.run_1707387164255862420.bucket_pairs.0.bin'...
num_singletons 51221382/52580188 (97.4157%)
=== step 3: 'build_index' 43.1616 [sec] (15.1025 [ns/kmer])
max_num_super_kmers_in_bucket 28977
log2_max_num_super_kmers_in_bucket 15
num_buckets_in_skew_index 9426/52580188(0.0179269%)
num_partitions 7
computing partitions...
num_kmers belonging to buckets of size > 64 and <= 128: 21978798
num_kmers belonging to buckets of size > 128 and <= 256: 17792929
num_kmers belonging to buckets of size > 256 and <= 512: 16696090
num_kmers belonging to buckets of size > 512 and <= 1024: 18053731
num_kmers belonging to buckets of size > 1024 and <= 2048: 17294062
num_kmers belonging to buckets of size > 2048 and <= 4096: 14098609
num_kmers belonging to buckets of size > 4096 and <= 28977: 20797100
num_kmers_in_skew_index 126711319(4.4337%)
building PTHash mphfs (with 8 threads) and positions...
lower 64; upper 128; num_bits_per_pos 7; keys_in_partition.size() 21978798
built mphs[0] for 21978798 keys; bits/key = 2.2951
built positions[0] for 21978798 keys; bits/key = 7.00002
lower 128; upper 256; num_bits_per_pos 8; keys_in_partition.size() 17792929
built mphs[1] for 17792929 keys; bits/key = 2.32146
built positions[1] for 17792929 keys; bits/key = 8.00002
lower 256; upper 512; num_bits_per_pos 9; keys_in_partition.size() 16696090
built mphs[2] for 16696090 keys; bits/key = 2.32747
built positions[2] for 16696090 keys; bits/key = 9.00002
lower 512; upper 1024; num_bits_per_pos 10; keys_in_partition.size() 18053731
built mphs[3] for 18053731 keys; bits/key = 2.31951
built positions[3] for 18053731 keys; bits/key = 10
lower 1024; upper 2048; num_bits_per_pos 11; keys_in_partition.size() 17294062
built mphs[4] for 17294062 keys; bits/key = 2.32405
built positions[4] for 17294062 keys; bits/key = 11
lower 2048; upper 4096; num_bits_per_pos 12; keys_in_partition.size() 14098609
built mphs[5] for 14098609 keys; bits/key = 2.35036
built positions[5] for 14098609 keys; bits/key = 12
lower 4096; upper 28977; num_bits_per_pos 15; keys_in_partition.size() 20797100
built mphs[6] for 20797100 keys; bits/key = 2.30088
built positions[6] for 20797100 keys; bits/key = 15
num_bits_for_skew_index 1592039440(0.557064 [bits/kmer])
=== step 4: 'build_skew_index' 209.907 [sec] (73.4478 [ns/kmer])
=== total_time 2324.4 [sec] (813.321 [ns/kmer])
total index size: 1199.73 [MB]
SPACE BREAKDOWN:
minimizers: 0.0537208 [bits/kmer]
pieces: 0.00327841 [bits/kmer]
num_super_kmers_before_bucket: 0.0309644 [bits/kmer]
offsets: 0.659683 [bits/kmer]
strings: 2.05363 [bits/kmer]
skew_index: 0.557064 [bits/kmer]
weights: 5.15061e-07 [bits/kmer]
weight_interval_values: 8.95759e-08 [bits/kmer]
weight_interval_lengths: 3.3591e-07 [bits/kmer]
weight_dictionary: 8.95759e-08 [bits/kmer]
--------------
total: 3.35834 [bits/kmer]
avg_nanosec_per_positive_lookup 4558.75
avg_nanosec_per_negative_lookup 5392.28
avg_nanosec_per_positive_lookup_advanced 4565.12
avg_nanosec_per_negative_lookup_advanced 5390.89
avg_nanosec_per_access 1563.39
iterator: avg_nanosec_per_kmer 34.9164 To summarize, total size is now 1199.73 [MB] at 3.35834 [bits/kmer]. Sorry, I already started BCALM2+UST yesterday and it took some time to construct, so I didn't have an opportunity to try GGCAT yet. If it's still relevant, could you please provide specific commands I should use to build the input dataset with GGCAT? |
Hi @adamant-pwn, ok that makes sense! I think this is the correct result. I would still perform a check to see if all kmers are found, etc. (just add --check at index construction, as you know). Thanks! For GGCAT, it should be very easy to use: just run the construction (no colored dBG needed, so without option -c) with -k 127 and --eulertigs. |
I'm sorry, I'm still a bit confused as to what I have to do with ggcat exactly. I built ggcat, and tried using the following command: $ gzip -d Homo_sapiens.GRCh38.dna.toplevel.fa.gz
$ ggcat build -k 127 -j 8 --eulertigs Homo_sapiens.GRCh38.dna.toplevel.fa
...
Compacted De Bruijn graph construction completed.
TOTAL TIME: 1387.88s
...
Final output saved to: output.fasta.lz4
$ lz4 -d output.fasta.lz4
Decoding file output.fasta
Decompressed : 135 MB Error 68 : Unfinished stream
$ ./sshash build -i ../data/unitigs_stitched/output.fasta -k 127 -m 31
...
total index size: 70.1673 [MB]
SPACE BREAKDOWN:
minimizers: 0.058696 [bits/kmer]
pieces: 0.0261435 [bits/kmer]
num_super_kmers_before_bucket: 0.032964 [bits/kmer]
offsets: 0.628415 [bits/kmer]
strings: 2.53139 [bits/kmer]
skew_index: 0.51633 [bits/kmer]
weights: 9.9489e-06 [bits/kmer]
weight_interval_values: 1.73024e-06 [bits/kmer]
weight_interval_lengths: 6.48841e-06 [bits/kmer]
weight_dictionary: 1.73024e-06 [bits/kmer]
--------------
total: 3.79395 [bits/kmer] Clearly, something goes wrong here, but I can't be sure what exactly. The error during the Running |
I think you did everything right. The output of GGCAT is some lz file that should be decompressed and given to SSHash as input. The point is that there must be a bug in GGCAT since the file did not uncompress without errors. Can you check if with some other small file GGCAT works just fine? If so, then it is definitely a bug in GGCAT occurring for larger files. |
The error actually persists even with smaller datasets: $ gzip -d se.ust.k31.fa.gz
$ ggcat build -k 31 -j 8 --eulertigs se.ust.k31.fa
...
Final output saved to: output.fasta.lz4
$ lz4 -d output.fasta.lz4
Decoding file output.fasta
Error 68 : Unfinished stream |
I opened a new issue about this with ggcat (algbio/ggcat#42), but maybe I'm just misuse ggcat somehow 🤔 |
Thanks. I'll try it myself later and see what happens. Perhaps is just the --eulertigs flag? |
I think I tried out without it as well, and got the same results. But surely would be nice to verify independently. |
In the meantime, here is the full log with |
Beautiful, thanks Oleksandr! |
Hi Giulio! I was on vacation for last two weeks, but am now back and ready to work further on this. That is, if further work is required 🙂 I saw that you accepted a pull request from Harun that creates a slight merge conflict with this PR. I suppose I should resolve it now, which I'll try to do soon. Are there any other actions related to this PR that you'd like me to take? |
fix warning when printing uint128_t
Hi @jermp any updates on merging this (and also on communications re: preparing a new paper)? 👀 I added two minor pull requests from Harun to our fork, they are very straightforward and just fix some warnings. But I expect that some further work might be needed in the future for integration of MetaGraph with SSHash, and it'd be easier if the current PR is merged, so that further changes in our fork won't interfere unduly with the PR. |
Hi @adamant-pwn, I've been busy with other work recently and didn't have time to start benchmarking this. Best, |
Authored-by: Harun Mustafa <[email protected]>
Hi @adamant-pwn, After this PR is merged into the
If you're interested in this future development, just let me know. |
@NPSDC —- we should try and get our PR in before this is ultimately merged so that we can benefit from these updates! |
Hi @jermp, thanks for the update! I merged your latest changes to master into the pull request, and also changed the target branch to
Sure! I think generally contributions to the project should be governed by one license. SSHash is already licensed by MIT License, I think it's fine to keep using it, but it'd make sense to update the list of authors.
Sure, please let me know if there is anything in particular that I can help with. |
Hi @jermp, This sounds great! I was wondering if we can figure out a merge order between the Thanks! |
Hi @adamant-pwn and @rob-p, and thank you for your prompt interactions. @adamant-pwn: thanks for the rebase. You'll definitely feature as one of the authors, as well as Rob and all the other people involved in a SSHash follow-up paper :) @rob-p: yes, you're right. I think it would probably be more difficult to include all Oleksandr's changes to the submodule-integration-for-piscem than the other way round. I just opened the PR from ( Or is @adamant-pwn willing to also incorporate those changes to his own PR?! :D ) |
@rob-p, let me know if the above plan makes sense to you. I think merging this PR first into |
Hi @jermp --- that seems good to me! As I said, I think that Noor's changes are quite small and orthogonal to Oleksandr's so that should hopefully be easy (famous last words). Let me know if you need any help with the merge at any point. |
Ok perfect; thanks @rob-p for confirming. So, I'll merge this PR first and then integrate Noor's changes there! |
Woot. Congrats @adamant-pwn and @jermp on the first major steps toward sshash 2! |
All credit so far goes to @adamant-pwn! Thank you again for this. |
Hi @jermp ! This is a continuation of my work that I mentioned in #35.
With this change, I rewrite current implementation of
kmer_t
in a full-fledged structure that handles bit manipulations over kmers. I think this should help greatly to compartmentalize current code that deals with alphabet characters <-> bits conversion, and also provides better semantics to some existing pieces of code, potentially improving their readability, and also allowing to use it with other alphabets.I tried my best to avoid performance sacrifices, while making it (hopefully) generic enough for the needs of Metagraph. As such, new kmer type would still assume that the underlying storage for kmers is an unsigned integer, with the number of bits in it divisible by
64
, but it should now support arbitrarily large integer types (so, 128-bit integers or some custom types with e.g. 256 bits should also work). To accommodate for larger alphabets, I still stick to the assumption that each alphabet character uses a fixed amount of bits, but now the specific number of bits is a template parameter, rather than the hardcoded number2
. I tried to highlight it in all the previous places where it was used as a hard-coded value.I also compared two implementations using the first example command from the readme file:
You may see the resulting stats below:
I guess there is some slight slowdown, especially if you change to
__uint128_t
, but it also seems to me that the slowdown is not very significant compared to generality that this approach brings.And, of course, I also tried out to run all the checks in the debug mode, both on the example query below, and also on datasets se.ust.k47.fa.gz and se.ust.k63.fa.gz. If you think that this change aligns with your vision for SSHash, I would be happy if you could review it, so that it can potentially be merged in the future.